Predicting Titanic Survival

From the description of a Kaggle Machine Learning Challenge at https://www.kaggle.com/c/titanic

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In this demo we will use MLDB to train a classifier to predict whether a passenger would have survived the Titanic disaster.

Initializing pymldb and other imports

In this demo, we will use pymldb to interact with the REST API: see the Using pymldb Tutorial for more details.


In [12]:
from pymldb import Connection
mldb = Connection("http://localhost")

#we'll need these also later!
import numpy as np
import pandas as pd, matplotlib.pyplot as plt, seaborn, ipywidgets
%matplotlib inline

Checking out the Titanic dataset

From https://www.kaggle.com/c/titanic

Load up the data

See the Loading Data Tutorial guide for more details on how to get data into MLDB.


In [13]:
mldb.put('/v1/procedures/import_titanic_raw', { 
    "type": "import.text",
    "params": { 
        "dataFileUrl": "http://public.mldb.ai/titanic_train.csv",
        "outputDataset": "titanic_raw",
        "runOnCreation": True
    } 
})


Out[13]:
PUT http://localhost/v1/procedures/import_titanic_raw
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-12-16T15:50:08.7610252Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-12-16T15:50:08.804541Z", 
      "id": "2016-12-16T15:50:08.760858Z-463496b56263af05", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "titanic_raw", 
      "runOnCreation": true, 
      "dataFileUrl": "http://public.mldb.ai/titanic_train.csv"
    }, 
    "type": "import.text", 
    "id": "import_titanic_raw"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_titanic_raw"
}

Let's look at the data

See the Query API documentation for more details on SQL queries.


In [14]:
mldb.query("select * from titanic_raw limit 5")


Out[14]:
Age Embarked Fare Name Parch PassengerId Pclass Sex SibSp Ticket label Cabin
_rowName
2 22 S 7.2500 BraundMr.OwenHarris 0 1 3 male 1 A/521171 0 None
3 38 C 71.2833 CumingsMrs.JohnBradley(FlorenceBriggsThayer) 0 2 1 female 1 PC17599 1 C85
4 26 S 7.9250 HeikkinenMiss.Laina 0 3 3 female 0 STON/O2.3101282 1 None
5 35 S 53.1000 FutrelleMrs.JacquesHeath(LilyMayPeel) 0 4 1 female 1 113803 1 C123
6 35 S 8.0500 AllenMr.WilliamHenry 0 5 3 male 0 373450 0 None

As a first step in the modelling process, it is often very useful to look at summary statistics to get a sense of the data. To do so, we will create a Procedure of type summary.statistics and store the results in a new dataset called titanic_summary_stats:


In [15]:
print mldb.post("/v1/procedures", {
    "type": "summary.statistics",
    "params": {
        "inputData": "SELECT * FROM titanic_raw",
        "outputDataset": "titanic_summary_stats",
        "runOnCreation": True
    }
})


<Response [201]>

We can take a look at numerical columns:


In [16]:
mldb.query("""
    SELECT * EXCLUDING(value.most_frequent_items*) 
    FROM titanic_summary_stats 
    WHERE value.data_type='number'
""").transpose()


Out[16]:
_rowName Fare SibSp PassengerId label Age Pclass Parch
value.1st_quartile 7.8958 0 223 0 20 2 0
value.3rd_quartile 31 1 669 1 38 3 0
value.avg 32.2042 0.523008 446 0.383838 29.6991 2.30864 0.381594
value.data_type number number number number number number number
value.max 512.329 8 891 1 80 3 6
value.median 14.4542 0 446 0 28 3 0
value.min 0 0 1 0 0.42 1 0
value.num_null 0 0 0 0 177 0 0
value.num_unique 248 7 891 2 88 3 7
value.stddev 49.6934 1.10274 257.354 0.486592 14.5265 0.836071 0.806057

Training a classifier

We will create another Procedure of type classifier.experiment. The configuration parameter defines a Random Forest algorithm.


In [17]:
result = mldb.put('/v1/procedures/titanic_train_scorer', {
    "type": "classifier.experiment",
    "params": {
        "experimentName": "titanic",
        "inputData": """
            select 
                {Sex, Age, Fare, Embarked, Parch, SibSp, Pclass} as features,
                label
            from titanic_raw
        """,
        "configuration": {
            "type": "bagging",
            "num_bags": 10,
            "validation_split": 0,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 10,
                "random_feature_propn": 0.3
            }
        },
        "kfold": 3,
        "modelFileUrlPattern": "file://models/titanic.cls",
        "keepArtifacts": True,
        "outputAccuracyDataset": True,
        "runOnCreation": True
    }
})

auc = np.mean([x["resultsTest"]["auc"] for x in result.json()["status"]["firstRun"]["status"]["folds"]])
print "\nArea under ROC curve = %0.4f\n" % auc


Area under ROC curve = 0.8311

We automatically get a REST API for predictions

The procedure above created for us a Function of type classifier.


In [18]:
@ipywidgets.interact
def score( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3], 
            Sex=["male", "female"], SibSp=[0,8]):
    return mldb.get('/v1/functions/titanic_scorer_0/application', input={"features": locals()})


GET http://localhost/v1/functions/titanic_scorer_0/application?input=%7B%22features%22%3A+%7B%22Fare%22%3A+50%2C+%22Embarked%22%3A+%22C%22%2C+%22Age%22%3A+40%2C+%22Parch%22%3A+4%2C+%22Pclass%22%3A+2%2C+%22Sex%22%3A+%22male%22%2C+%22SibSp%22%3A+4%7D%7D
200 OK
{
  "output": {
    "score": 0.2905334234237671
  }
}

What's in a score?

Scores aren't probabilities, but they can be used to create binary classifiers by applying a cutoff threshold. MLDB's classifier.experiment procedure outputs a dataset which you can use to figure out where you want to set that threshold.


In [19]:
test_results = mldb.query("select * from titanic_results_0 order by score desc")
test_results.head()


Out[19]:
accuracy falseNegatives falsePositiveRate falsePositives index label precision recall score trueNegatives truePositiveRate truePositives weight
_rowName
601 0.620072 106 0.000000 0 1 1 1.00 0.009346 0.781576 172 0.009346 1 1
558 0.623656 105 0.000000 0 2 1 1.00 0.018692 0.768337 172 0.018692 2 1
488 0.627240 104 0.000000 0 3 1 1.00 0.028037 0.741543 172 0.028037 3 1
700 0.623656 104 0.005814 1 4 0 0.75 0.028037 0.730499 171 0.028037 3 1
617 0.627240 103 0.005814 1 5 1 0.80 0.037383 0.730317 171 0.037383 4 1

Here's an interactive way to graphically explore the tradeoffs between the True Positive Rate and the False Positive Rate, using what's called a ROC curve.

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.


In [20]:
@ipywidgets.interact
def test_results_plot( threshold_index=[0,len(test_results)-1]):
    row = test_results.iloc[threshold_index]
    cols = ["trueNegatives","falsePositives","falseNegatives","truePositives",]
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    test_results.plot(ax=ax1, x="falsePositiveRate", y="truePositiveRate", 
    legend=False, title="ROC Curve, threshold=%.4f" % row.score).set_ylabel('truePositiveRate')
    ax1.plot(row.falsePositiveRate, row.truePositiveRate, 'gs')
    
    ax2.pie(row[cols], labels=cols, autopct='%1.1f%%', startangle = 90,
            colors=['lightskyblue','lightcoral','lightcoral', 'lightskyblue'])
    ax2.axis('equal')
    f.subplots_adjust(hspace=.75)
    plt.show()


But what is the model doing under the hood?

Let's create a function of type classifier.explain to help us understand what's happening here.


In [21]:
mldb.put('/v1/functions/titanic_explainer', { 
    "id": "titanic_explainer", 
    "type": "classifier.explain",
    "params": { "modelFileUrl": "file://models/titanic.cls" }
})


Out[21]:
PUT http://localhost/v1/functions/titanic_explainer
201 Created
{
  "status": {
    "mode": "regression", 
    "summary": "COMMITTEE"
  }, 
  "config": {
    "params": {
      "modelFileUrl": "file://models/titanic.cls"
    }, 
    "type": "classifier.explain", 
    "id": "titanic_explainer"
  }, 
  "state": "ok", 
  "type": "classifier.explain", 
  "id": "titanic_explainer"
}

Exploring the impact of features for a single example

NOTE: the interactive part of this demo only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.


In [22]:
@ipywidgets.interact
def sliders( Age=[0,80],Embarked=["C", "Q", "S"], Fare=[1,100], Parch=[0,8], Pclass=[1,3], 
            Sex=["male", "female"], SibSp=[0,8]):
    features = locals()
    x = mldb.get('/v1/functions/titanic_explainer/application', input={"features": features, "label": 1}).json()["output"]
   
    df = pd.DataFrame(
        {"%s=%s" % (feat, str(features[feat])): val for (feat, (val, ts)) in x["explanation"]}, 
        index=["val"]).transpose().cumsum()
    pd.DataFrame(
        {"cumulative score": [x["bias"]]+list(df.val)+[df.val[-1]]}, 
        index=['bias'] + list(df.index) + ['final']
    ).plot(kind='line', drawstyle='steps-post', legend=False, figsize=(15, 5), 
           ylim=(-1, 1), title="Score = %.4f" % df.val[-1]).set_ylabel('Cumulative Score')
    
    plt.show()


Summing up explanation values to get overall feature importance

When we sum up the explanation values in the context of the correct label, we can get an indication of how important each feature was to making a correct classification.


In [23]:
df = mldb.query("""
select label, sum(
    titanic_explainer({
        label: label, 
        features: {Sex, Age, Fare, Embarked, Parch, SibSp, Pclass}
    })[explanation]
) as *
from titanic_raw group by label
""")
df.set_index("label").transpose().plot(kind='bar', title="Feature Importance", figsize=(15, 5))
plt.xticks(rotation=0)
plt.show()


We can also load up a custom UI for this


In [24]:
mldb.put('/v1/plugins/pytanic', {
    "type":"python",
    "params": {"address": "git://github.com/datacratic/mldb-pytanic-plugin"}
})


Out[24]:
PUT http://localhost/v1/plugins/pytanic
201 Created
{
  "config": {
    "params": {
      "address": "git://github.com/datacratic/mldb-pytanic-plugin"
    }, 
    "type": "python", 
    "id": "pytanic"
  }, 
  "state": "ok", 
  "type": "python", 
  "id": "pytanic"
}

Now you can browse to the plugin UI.

NOTE: this only works if you're running this Notebook live, not if you're looking at a static copy on http://docs.mldb.ai. See the documentation for Running MLDB.

Where to next?

Check out the other Tutorials and Demos.